Bayesian Reinforcement Learning with Gaussian Process Temporal Difference Methods

نویسندگان

Yaakov Engel

Shie Mannor

Ron Meir

چکیده

Reinforcement Learning is a class of problems frequently encountered by both biological and artificial agents. An important algorithmic component of many Reinforcement Learning solution methods is the estimation of state or state-action values of a fixed policy controlling a Markov decision process (MDP), a task known as policy evaluation. We present a novel Bayesian approach to policy evaluation in general state and action spaces, which employs statistical generative models for value functions via Gaussian processes (GPs). The posterior distribution based on a GP-based statistical model provides us with a value-function estimate, as well as a measure of the variance of that estimate, opening the way to a range of possibilities not available up to now. We derive exact expressions for the posterior moments of the value GP, which admit both batch and recursive computations. An efficient sequential kernel sparsification method allows us to derive efficient online algorithms for learning good approximations of the posterior moments. By allowing our algorithms to evaluate state-action values we derive model-free algorithms based on Policy Iteration for improving policies, thus tackling the complete RL problem. A companion paper describes experiments conducted with the algorithms presented here.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reinforcement learning with kernels and Gaussian processes

Kernel methods have become popular in many sub-fields of machine learning with the exception of reinforcement learning; they facilitate rich representations, and enable machine learning techniques to work in diverse input spaces. We describe a principled approach to the policy evaluation problem of reinforcement learning. We present a temporal difference (TD) learning using kernel functions. Ou...

متن کامل

Bayesian Policy Gradient and Actor-Critic Algorithms

Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. The policy is improved by adjusting the parameters in the direction of the gradient estimate. Since Monte-Carlo methods tend to have high variance, a large num...

متن کامل

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

The Octopus arm is a highly versatile and complex limb. How the Octopus controls such a hyper-redundant arm (not to mention eight of them!) is as yet unknown. Robotic arms based on the same mechanical principles may render present day robotic arms obsolete. In this paper, we tackle this control problem using an online reinforcement learning algorithm, based on a Bayesian approach to policy eval...

متن کامل

Bayesian Multi-Task Reinforcement Learning

We consider the problem of multi-task reinforcement learning where the learner is provided with a set of tasks, for which only a small number of samples can be generated for any given policy. As the number of samples may not be enough to learn an accurate evaluation of the policy, it would be necessary to identify classes of tasks with similar structure and to learn them jointly. We consider th...

متن کامل